Skip to content

docs(longhaul): add long-haul test design document#400

Open
WentingWu666666 wants to merge 8 commits into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/longhaul-design-doc
Open

docs(longhaul): add long-haul test design document#400
WentingWu666666 wants to merge 8 commits into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/longhaul-design-doc

Conversation

@WentingWu666666

@WentingWu666666 WentingWu666666 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Part 1/5 of #348 split.

Scope

Adds docs/designs/long-haul-test-design.md (367 lines, new file).

Content

  • Goals and non-goals for the long-haul test
  • Architecture diagram (writer/verifier loop, operations scheduler, monitor, journal, report)
  • Data plane invariants: majority writes, gap detection, checksum validation
  • Failure modes and disruption window policy
  • HA-gated upgrade scenario, spec.instancesPerNode scaling
  • Relationship to test/e2e

Verification

Docs-only; no build/test impact.

Related

Splits #348 into 5 focused PRs:

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds (or refreshes) the long-haul (canary) test driver for the DocumentDB Kubernetes Operator: a standalone Go module under test/longhaul/ with writers/verifiers, disruption-window journaling, a weighted-random operations scheduler (scale + DocumentDB upgrade), health/leak monitoring, periodic reporting to a longhaul-report ConfigMap, plus in-cluster Deployment packaging and GitHub Actions workflows (build/deploy/monitor). It also updates the long-haul design document and supporting README/manifests.

Changes:

  • Introduces the long-haul test driver Go module (test/longhaul/) with workload, monitor, operations, journal, and reporting components.
  • Adds Kubernetes deployment artifacts (Deployment + RBAC + setup manifest) and CI workflows to build, deploy, and monitor the long-haul canary.
  • Updates docs/designs/long-haul-test-design.md to describe the architecture, invariants, and operations catalog.

Reviewed changes

Copilot reviewed 37 out of 38 changed files in this pull request and generated 24 comments.

Show a summary per file
File Description
test/longhaul/workload/writer.go Writer loop that inserts majority-acknowledged documents with checksum/seq tracking.
test/longhaul/workload/verifier.go Periodic verifier scanning for gaps and checksum mismatches under majority read concern.
test/longhaul/workload/metrics.go Atomic counters + snapshot helpers for workload metrics.
test/longhaul/report/suite_test.go Ginkgo suite bootstrap for report package tests.
test/longhaul/report/report.go Markdown report generator for long-haul run state.
test/longhaul/report/checkpoint.go Periodic reporter writing stdout + longhaul-report ConfigMap.
test/longhaul/report/checkpoint_test.go Unit tests for ConfigMap create/update/result-field behavior.
test/longhaul/report/alert.go GitHub Actions annotations for pass/fail/leak warnings.
test/longhaul/README.md Driver usage, deployment instructions, and config reference.
test/longhaul/operations/upgrade.go DocumentDB version upgrade operation using desired-version ConfigMap + steady-state gate.
test/longhaul/operations/suite_test.go Ginkgo suite bootstrap for operations tests.
test/longhaul/operations/scheduler.go Weighted-random scheduler with cooldown + steady-state gating + disruption windows.
test/longhaul/operations/scheduler_test.go Unit tests for weighted selection + cooldown short-circuiting.
test/longhaul/operations/scale.go Scale up/down operations with patch-confirmation polling and outage policies.
test/longhaul/monitor/suite_test.go Ginkgo suite bootstrap for monitor tests.
test/longhaul/monitor/leakdetect.go Linear-regression leak detector over sampled memory/CPU.
test/longhaul/monitor/k8sclient.go Real Kubernetes client implementation (pods/CR/metrics, CR patching).
test/longhaul/monitor/health.go Health monitor with steady-state tracking and recovery waits.
test/longhaul/monitor/health_test.go Unit tests for steady-state and wait semantics using a fake ClusterClient.
test/longhaul/journal/suite_test.go Ginkgo suite bootstrap for journal tests.
test/longhaul/journal/policy.go Outage policy + disruption window evaluation logic.
test/longhaul/journal/policy_test.go Unit tests pinning boundary behavior of the verdict oracle.
test/longhaul/journal/journal.go Thread-safe append-only journal + disruption-window tracking.
test/longhaul/journal/journal_test.go Unit tests for journal behavior and concurrency safety.
test/longhaul/go.sum Dependency lockfile for the long-haul module.
test/longhaul/go.mod New standalone Go module for the long-haul driver.
test/longhaul/Dockerfile Multi-stage container build for the long-haul binary.
test/longhaul/deploy/setup.yaml Namespace + DocumentDB CR bootstrap manifest for the canary cluster.
test/longhaul/deploy/rbac.yaml ServiceAccount + Role/Bindings + metrics ClusterRole for the driver.
test/longhaul/deploy/deployment.yaml ConfigMap-driven Deployment manifest for in-cluster execution.
test/longhaul/config/suite_test.go Ginkgo suite bootstrap for config tests.
test/longhaul/config/config.go Env-driven config loading + validation for the driver.
test/longhaul/config/config_test.go Unit tests for env parsing, validation, and enable flag parsing.
test/longhaul/cmd/longhaul/main.go Standalone binary wiring: Mongo workload, ops scheduler, monitoring, reporting.
docs/designs/long-haul-test-design.md Updated design doc describing architecture, invariants, and phases.
.github/workflows/longhaul-monitor.yaml Hourly monitor workflow for Deployment health/report staleness + version publishing.
.github/workflows/longhaul-image-build.yml Workflow to build/push the long-haul driver image to GHCR.
.github/workflows/longhaul-deploy.yml Workflow to roll the driver Deployment on AKS (manual + workflow_run).

Comment thread test/longhaul/monitor/k8sclient.go Outdated
Comment thread test/longhaul/monitor/k8sclient.go Outdated
Comment thread test/longhaul/operations/scale.go Outdated
Comment thread test/longhaul/operations/scale.go Outdated
Comment thread test/longhaul/config/config.go Outdated
Comment thread test/longhaul/cmd/longhaul/main.go Outdated
Comment thread test/longhaul/cmd/longhaul/main.go Outdated
Comment thread test/longhaul/monitor/k8sclient.go Outdated
Comment thread test/longhaul/workload/writer.go Outdated
Comment thread test/longhaul/config/config.go Outdated
@documentdb-triage-tool documentdb-triage-tool Bot added CI/CD dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation test labels Jun 10, 2026
@documentdb-triage-tool

Copy link
Copy Markdown

🤖 Auto-triaged by documentdb-triage-tool.

Applied: test, CI/CD, documentation, dependencies
Project fields suggested: Component test · Priority P3 · Effort XL · Status Needs Review
Confidence: 0.97 (mixed)

Reasoning

component from path globs (test, ci, docs, dependencies); effort from diff stats (5023+0 LOC, 38 files); LLM: Single-file docs-only update to a design document with no build or test impact, part of a larger split PR series.

If a label is wrong, remove it manually and ping @patty-chow so the rules can be tuned. The bot will not re-label items that already have component labels.

@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/longhaul-design-doc branch from 15eb6f4 to ff2c1cb Compare June 10, 2026 14:00
@WentingWu666666 WentingWu666666 changed the title docs(longhaul): update long-haul test design document (1/5 of #348) docs(longhaul): add long-haul test design document (1/5 of #348) Jun 10, 2026
@WentingWu666666 WentingWu666666 marked this pull request as draft June 10, 2026 14:30
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/longhaul-design-doc branch 13 times, most recently from 031d0a5 to f848d41 Compare June 10, 2026 15:14
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/longhaul-design-doc branch from f848d41 to fc797a3 Compare June 10, 2026 15:14
Adds the design doc covering goals, architecture (writer/verifier loop,
operations scheduler, monitor, journal, report), data plane invariants
(majority writes, gap detection, checksum validation), failure modes,
and relationship to test/e2e.

Split from documentdb#348 as a standalone reviewable PR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/longhaul-design-doc branch from fc797a3 to bdc8cf0 Compare June 10, 2026 15:18
@WentingWu666666 WentingWu666666 changed the title docs(longhaul): add long-haul test design document (1/5 of #348) docs(longhaul): add long-haul test design document Jun 10, 2026
@WentingWu666666 WentingWu666666 marked this pull request as ready for review June 10, 2026 15:20
Comment thread docs/designs/long-haul-test-design.md
…eate)

Address @hossain-rayhan feedback on documentdb#400: the doc didn't make explicit
that a Fatal failure preserves the cluster for post-mortem rather than
auto-recreating it, and that recovery is manually triggered after a
maintainer reviews the alert from the monitor.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
| **Journal** | In-process append-only event log shared by all components. | Reproducible event stream for the report. |
| **Report** | Aggregates the journal into a markdown summary at a configurable interval; raises alerts on threshold breaches. | Markdown report; alert lines. |

### Cluster Topology

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should mention that where possible we want to reuse the code from the e2e tests (e.g. client)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a "Code reuse" paragraph at the end of the Architecture section in ab9a009 (will update SHA in reply if push lands differently):

Where possible, the driver consumes the same helpers as the e2e suite — the Mongo client, DocumentDB lifecycle operations (create / patch / wait-healthy / delete), and TLS plumbing all live in a shared test/shared Go module. This keeps long-haul behavior aligned with what e2e exercises and avoids two diverging mongo-driver wrappers.

That test/shared module is what #401 extracts — long-haul will consume it from day one.


## Lifecycle

The test runs **continuously** — no cycles, no resets. Workload, metrics, operations, and health monitoring all run as long-lived processes. The system accumulates real state (PVC growth, CR history, operator memory) exactly as it would in production.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please eleborate how we deal with different versions, e.g. are we always runnign the latest, current, etc. When are we updating? Part of the test?

Is there a point when we start over? What's the criteria?


The test runs **continuously** — no cycles, no resets. Workload, metrics, operations, and health monitoring all run as long-lived processes. The system accumulates real state (PVC growth, CR history, operator memory) exactly as it would in production.

**Workload runs through upgrades.** No drain, no quiesce. Draining before upgrade hides exactly the upgrade-under-state bugs we're testing.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aevwe also downgrading? Or are we starting over at some point so we can test upgrade more than once?

| **Lifecycle** | DocumentDB version upgrade, operator upgrade |
| **HA** | controlled failover |
| **Chaos** | kill primary pod, drain node |
| **Data protection** | trigger backup, verify backup |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have operator upgrades as well?
Operator chaos?
Remote nodes/mukti-region? (maybe not now but potentially planned in the future)

Comment thread docs/designs/long-haul-test-design.md Outdated
- One disruptive op at a time. Overlapping disruptions are non-diagnosable.
- Per-category cooldown between ops. Lets the cluster stabilize.
- Steady-state gate — health check must pass before the next op fires.
- Backup isolation — no topology changes during backup.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not? Backup should block/delay - this should be handled by backup

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — dropped the bullet (d1694f5). Backup-vs-topology is the backup feature's job; isolating it in the harness would hide exactly the bugs we want to catch.

Comment thread docs/designs/long-haul-test-design.md
**Per-component attribution.** Metrics are tagged by component (operator pod RSS, DB pod RSS, goroutine count, reconcile rate, API-call rate). Without separate series, a memory climb at hour 30 is undiagnosable.

**Human-in-the-loop alerts.** The hourly monitor posts a summary to the workflow run and, when configured, to a chat channel. A maintainer reviews the evidence and manually creates a GitHub issue. No auto-filed issues — alert fatigue from transient or infrastructure failures would erode trust in the canary.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shoudl also record the system dashboard metrics (latency, uptime, etc.) as well as logs of all components for later analysis; where do we keep them?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an Artifact Retention subsection in 314f3dc. Two tiers:

  • Rolling statuslonghaul-report ConfigMap polled by the monitor workflow (this part is already used in the existing driver).
  • Forensics bundle — pod logs, events, CR snapshots, metric samples, journal — uploaded as a GitHub Actions artifact on every Tier-1 / Tier-2 alert and at end of run.

Operational details (which collectors, sanitization rules, bundle layout) are kept in test/longhaul/README.md rather than the design doc, so the design stays high-level.

| **CloudNative-PG** | Failover via pod delete + SIGSTOP; pod-level resource sampling | Ginkgo framework (we use a long-lived `Deployment` instead) |
| **CockroachDB** | Chaos runner; separate workload from disruption; roachstress | Custom roachtest framework (too heavy) |
| **Vitess** | Background stress goroutine; per-query tracking | No fault injection (we need disruptive ops) |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are also interested how FoundationDB tests (they turned their approach into Anithesis) - not sure if they cover long haul though

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added FoundationDB and Antithesis as separate rows in the Learnings table (309aefc). Short answer to your question: neither covers long-haul — both run in simulated time on fake network/disk, so they catch rare-interleaving logic bugs in seconds but can't surface the wall-clock accumulation bugs (mem leaks, lock-table bloat, CR-history drift) that need real reconciliation cycles over real days. We adopt their property-based oracle and workload/fault separation.

Comment thread docs/designs/long-haul-test-design.md Outdated

## Open Questions

1. Multi-region canary scope — AKS Fleet integration?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, at a later point

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — renamed Open Questions -> Future Scope and reworded the multi-region item so it reads as explicitly deferred (still a candidate before GA if scope allows). See 8a79edd.

Copilot AI added 6 commits June 15, 2026 12:53
Per @xgerman feedback, call out that both Primary and Baseline run

with production-style podAntiAffinity and a PodDisruptionBudget so

chaos and upgrade operations exercise operator/DB bugs rather than

misconfiguration failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per @xgerman feedback, state explicitly that the driver reuses e2e

helpers (Mongo client, DocumentDB lifecycle ops, TLS plumbing) from

the shared test/shared Go module rather than forking them.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per @xgerman, multi-region (AKS Fleet) is deferred. Rename the

Open Questions section to Future Scope and reword the item so the

deferred status is explicit; no open design questions remain.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per @xgerman, call out FDB Simulation and Antithesis. Both are

deterministic simulation tools that target rare-interleaving logic

bugs in simulated time; they don't cover the wall-clock accumulation

bugs that long-haul exists for. We adopt their property-based oracle

and workload/fault separation, not the simulation engine itself.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per @xgerman, spell out where evidence is kept. Two tiers: a rolling

status summary in the longhaul-report ConfigMap (already used by the

monitor workflow), and a forensics bundle uploaded as a GitHub Actions

artifact on alert and at end of run. Operational details (collectors,

sanitization, layout) belong in test/longhaul/README.md, not the design.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per @xgerman, isolating backup from topology hides exactly the

serialization bugs long-haul should catch. Backup-vs-topology is the

backup feature's responsibility, not the harness's.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants